Context

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Key question

Which factors have a high influence on booking cancellations ?

Dataset

Import the necessary packages

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Let's check the duplicate data. And if any, we should remove it.

Let's drop the duplicate values

Get the information about the dataset

Insights:

Check for missing values

Summary of the dataset

Observations-

EDA

Univariate analysis

Observations

Observations

Bivariate Analysis

Observation

Summary of EDA

Data Description:

Observations from EDA:

Data Preparation

Replacing the catagorical data into the continious

Split Data into test and train dataset.

Logistic Regression using the sklearn library model

The confusion matrix using Logistic Regression

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will not Cancel the booking but in reality the customer cancelled the booking. - Loss of resources (revenue)

  2. Predicting a customer will Cancel the booking but in reality the customer did not cancelled the booking. - Might be loss of future opportunity and reputation.

Which case is more important?

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Checking performance on training set

Checking performance on test set

Observations

This shows that the model is giving generalised results.

We have build a logistic regression model which shows good performance on the train and test sets but to identify significant variables we will have to build a logistic regression model using the statsmodels library. We will now perform logistic regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.

Using statsmodels, we will be able to check the statistical validity of our model - identify the significant predictors from p-values that we get for each predictor variable.

Logistic Regression (with statsmodels library)

Observations

Additional Information on VIF

Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearity that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is "inflated" by the existence of correlation among the predictor variables in the model.

General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.

None of the variables exhibit high multicollinearity, so the values in the summary are reliable.

Observation

Coefficient interpretations

Converting coefficients to odds

Coefficient interpretations

Checking model performance on the training set

ROC-AUC

ROC-AUC on training set

Logistic Regression model is giving a good performance on training set.

Model Performance Improvement

Let's see if the f1 score can be improved further, by changing the model threshold using AUC-ROC Curve

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

At threshold around 0.41 we get balanced recall and precision.

Checking model performance on training set

Model is performing ok on training set. There's not much improvement in the model performance on the default threshold of 0.41 the optimal threshold.

Model Performance Summary

Let's check the performance on the test set

Using model with default threshold

ROC curve on test set

Using model with optimal threshold

Using model with threshold = 0.41

Model performance summary

Conclusion

Recommendations

Build Decision Tree Model and visualizing the Decision Tree

Scoring our Decision Tree

According to the decision tree model, lead_time is the most important variable for predicting the customer booking cancellation . The tree above is very complex, such a tree often overfits.

Reducing over fitting

In general, the deeper you allow your tree to grow, the more complex your model will become because you will have more splits and it captures more information about the data and this is one of the root causes of overfitting Let's try Limiting the max_depth of tree to 3

Decision Tree (Pre-pruning with max depth 3)

Confusion Matrix - decision tree with depth restricted to 3

Recall on training set has reduced from 0.99 to 0.75 so we can say that this is an improvement because now the model is not overfitting and we have a generalized model.

The tree has become readable now but the recall on test set has not improved.

Decision tree with max depth 5

Since recall value has gone down when compared to the prepruned tree with max depth 3, we would not like to consider this model

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Total impurity of leaves vs effective alphas of pruned tree

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusion

(a) Lead Time > 9.5 and lead_time <= 150.50 , no_of_special_requests <= 0.50 , market_segment_type_1 > 0.50, then the person is more likely to cancel (b) If lead_time > 150.50 but avg_price_per_room <= 100.04, customer has a higher chance of cancelling (c) Similarly if lead_time > 150.50 and even if avg_price_per_room > 100.04, if the customer has lesser special requests than 2.5, Then the customer is very likely to cancel.

Recommendations

Recommendations According to the decision tree model -

a) Customer booking Lead Time is one of the most important variable to determine the cancellation odds i.e. if the customer will be cancelling the booking

b) Market_Segment_Type is another important field for determining the customer cancellation trend

c) Avg_Price Per room is indicated by more than one tree models as an important field for determining the customer cancellation trend

d) No_Of_Special Request is a factor in determining the customer cancellation trend

Lead Time is a key indicator . Stat Hotel can run marketing schemes to incentivize customers to book in advance along with special offers and requests to lock the booking and make them aware through channels to ensure that they are more engaged and less likely to cancel

It is observed that 36% guests had special requests and were less likely to cancel the bookings. This depicts clear trend across Customers who do adequate planning with room details are less likely to cancel. Star Hotel can engage with the customers and increase this data collection resulting in deeper engagement

Repeat guests % is lower (3.1%) but they were least likely to cancel. Star Hotel should incentivize customers with schemes for Repeat guests as it may improve the non-cancellation trends